Augmented Proximal Policy Optimization for Safe Reinforcement Learning
نویسندگان
چکیده
Safe reinforcement learning considers practical scenarios that maximize the return while satisfying safety constraints. Current algorithms, which suffer from training oscillations or approximation errors, still struggle to update policy efficiently with precise constraint satisfaction. In this article, we propose Augmented Proximal Policy Optimization (APPO), augments Lagrangian function of primal constrained problem via attaching a quadratic deviation term. The constructed multiplier-penalty dampens cost oscillation for stable convergence being equivalent precisely control costs. APPO alternately updates and multiplier solving augmented primal-dual problem, can be easily implemented by any first-order optimizer. We apply our methods in diverse safety-constrained tasks, setting new state art compared comprehensive list safe RL baselines. Extensive experiments verify merits method easy implementation, convergence, control.
منابع مشابه
Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning
Constrained Markov Decision Process (CMDP) is a natural framework for reinforcement learning tasks with safety constraints, where agents learn a policy that maximizes the long-term reward while satisfying the constraints on the long-term cost. A canonical approach for solving CMDPs is the primal-dual method which updates parameters in primal and dual spaces in turn. Existing methods for CMDPs o...
متن کاملSafe and Efficient Off-Policy Reinforcement Learning
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of “off-policyness”; and (3) it is efficient as it makes the b...
متن کاملSafe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret
Lifelong reinforcement learning provides a promising framework for developing versatile agents that can accumulate knowledge over a lifetime of experience and rapidly learn new tasks by building upon prior knowledge. However, current lifelong learning methods exhibit non-vanishing regret as the amount of experience increases, and include limitations that can lead to suboptimal or unsafe control...
متن کاملSafe exploration for reinforcement learning
In this paper we define and address the problem of safe exploration in the context of reinforcement learning. Our notion of safety is concerned with states or transitions that can lead to damage and thus must be avoided. We introduce the concepts of a safety function for determining a state’s safety degree and that of a backup policy that is able to lead the controlled system from a critical st...
متن کاملPolicy Improvement through Safe Reinforcement Learning in High-Risk Tasks
Reinforcement Learning (RL) methods are widely used for dynamic control tasks. In many cases, these are high risk tasks where the trial and error process may select actions which execution from unsafe states can be catastrophic. In addition, many of these tasks have continuous state and action spaces, making the learning problem harder and unapproachable with conventional RL algorithms. So, whe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i6.25888